Chromatin Immuno Precipitation followed by Sequencing

  • One of the early applications of NGS

  • First studies published in 2007:

    • Johnson et al (Science) -NRSF

    • Barski et al (Cell) - histone methylation

    • Robertson et al (Nature Methods) - STAT

    • Mikkelsen et al (Nature) - histone modification

  • 4000 publications currently in PubMed

APPLICATIONS

  • Protein-DNA interaction

    • Identification of transcription factor binding

  • Histone modifications

Experiment Overview

.

Image adapted from: Park P (2009), Nature Reviews Genetics, 10, 669-680.

Experiment Overview

Resolution

High - single nucleotide

Coverage

Limited by “alignability” of reads to the genome

Cost

Around £1000 per lane

Source of noise

Sequencing bias, GC bias, Sequencing error

Amount DNA required

Low 10 - 50 ng

Multiplexing

Possible

Image adapted from: Park P (2009), Nature Reviews Genetics, 10, 669-680.

Sample Preparation

Transcription factor binding Histone modifications and Nucleosome positioning
Crosslinking Formaldehyde Usually not
Fragmentation Sonication(200-600bp) MNase Treatment
Immunoprecipitation Antibody specific to protein Antibody specific to histonemodification or histone

Experimental Design

  • Antibody quality
  • Control experiment
  • Depth of sequencing
  • Multiplexing
  • Paired-end reads

ANTIBODY QUALITY

  • Antibody quality - a sensitive and specificantibody will give a high level of enrichment

    • Limited efficiency of antibody is the main reason for failed ChIP-seq experiments
  • Check your antibody ahead if possible

    • Western blotting to check the reactivity of the antibody with unmodified and non-histone proteins
  • Optimize ChIP protocol

    • If known positives and negatives are available, perform qPCR to demonstrate enrichment for these regions

Experimental Design

  • Antibody quality
  • Control experiment
  • Depth of sequencing
  • Multiplexing
  • Paired-end reads

The Need for a Control Sample

  • Open chromatin regions are fragmented more easily than closed regions.

  • Repetitive sequences might seem to be enriched (inaccurate repeats copy number in the assembled genome).

  • Uneven distribution of sequence tags across the genome

  • A ChIP-seq peak should be compared with the same region in a matched control

Image adapted from: Rozowsky et al. (2009) Nature Biotechnology, 27:66-75.

CONTROL TYPE

  • Input DNA

  • Mock IP - DNA obtained from IP without antibody

    • Very little material can be pulled down leading to inconsistent results of multiple mock IPs
  • Nonspecific IP - using an antibody against a protein that is not known to be involved in DNA binding

  • Sequencing a control can be avoided when looking at:

    • time points

    • differential binding pattern between conditions

Experimental Design

  • Antibody quality
  • Control experiment
  • Depth of sequencing
  • Multiplexing
  • Paired-end reads

Sequencing Depth

  • More prominent peaks are identified with fewer reads, versus weaker peaks that require greater depth

  • Number of putative target regions continues to increase as a function of sequencing depth

Image adapted from: Rozowsky et al. (2009) Nature Biotechnology, 27:66-75.

Sequencing Depth

  • With current sequencing technologies for human/mouse >20M uniquely mapped reads is usually sufficient

  • HiSeq 4000 - ~300-350 Million reads per lane

Image adapted from: Rozowsky et al. (2009) Nature Biotechnology, 27:66-75.

Experimental Design

  • Antibody quality
  • Control experiment
  • Depth of sequencing
  • Multiplexing
  • Paired-end reads

Sample barcoding and de-multiplexing

Barcoding

Demultiplexing

  • Read 1
    ATTAGGCCTAAGCA… - Sample A
  • Read 2
    GAGCAACGACTACT… - Sample B
  • Read 3
    ATTAGGCCATACAT… - Sample A
  • Read 4
    CCATAGGCTGACTA… - Sample C

Experimental Design

  • Antibody quality
  • Control experiment
  • Depth of sequencing
  • Multiplexing
  • Paired-end reads

Paired-End Sequencing

  • DNA fragments are sequenced from both ends

  • Increases “mappability” - especially in repetitive regions

  • Reduced duplicates

  • Costs twice as much as single end reads

  • For ChIP-seq, usually not worth the extra cost, unless you have a specific interest in repeat regions

Analysis Workflow Overview

Alignment

Goal: Given a reference sequence and a set of short reads, align each read to the most likely origin of the fragment from which the read came.

Alignment

Mappability

  • Not all of the genome is ‘available’ for mapping

  • Align your reads to the unmasked genome

Table from: Rozowsky et al. (2009) Nature Biotechnology, 27:66-75.

*Figure calculated based on 30nt sequence tags

  • For ChIP-seq, usually short reads are used (50/100bp)

Reads can map in multiple locations

  • Some parts of the genome will not be unique:

    • Common, repeated motifs (proteins domains)

    • Repeat regions

Duplicate reads

  • Reads that align in exactly the same place (same start + same CIGAR string)

  • Duplicates can occur from:

    • Artefacts from sequencing (PCR artefacts)

    • Real biological signal

    • We cannot tell apart which one, unless we use barcodes.

Quality Control - alignment

Quality Control - Enrichment

Analysis Workflow Overview

Peak Calling

  • Basic:

    • Regions are scored by the number of tags in a window of a given size.

    • Then assess by enrichment over control and minimum tag density.

  • Advanced: take advantage of the directionality of the reads.

Image adapted from: Kharchenko et al. (2008), Nat Biotechnol. 26:1351-1359

Peak Calling - Strand Specific Profile

Image from: Kharchenko et al. (2008), Nat Biotechnol. 26:1351-1359

Peak Calling - Challenges

  • Adjust for sequence alignability - regions that contain repetitive elements have different expected tag count

  • Different ChIP-seq applications produce different type of peaks. Most current tools have been designed to detect sharp peaks (TF binding, histone modifications at regulatory elements)

  • Alternative tools exist for broader peaks (histone modifications that mark domains - transcribed or repressed), e.g. SICER

Peak Calling - Broad Marks and Sharp Peaks

Image adapted from: Park P (2009), Nature Reviews Genetics, 10, 669-680.

Peak Calling - MACS2

Model the shift size between +/- strand tags:

  • Scan the genome to find regions with tags enriched relative to random tag distribution

  • Randomly sample 1000 of these (high quality peaks) and calculate the distance between the modes of their +/- peaks

  • Shift all the tags by d/2 toward the 3’ end.

Peak Calling - MACS2

  • Identify candidate peaks using a Poisson distribution to model background

  • Candidate peaks are evaluated by comparing them against a “local” distribution.

  • False Discovery Rate (FDR) is calculated (#peaks in control) / (#peaks in IP)

MACS Peak Detection

  • Remove duplicate tags (in excess of what can be expected by chance)

  • Slide window across the genome to find candidate peaks with a significant tag enrichment (Poisson distribution, global background, p-value 10e-5)

  • Also looks at local background levels and eliminates peaks that are not significant with respect to local background

  • Uses the control sample to eliminate peaks that are also present there

Analysis Workflow Overview

Analysis Downstream to Peak Calling

  • Visualisation - genome browser: Ensembl, UCSC, IGV

Analysis Downstream to Peak Calling

  • Visualisation - genome browser: Ensembl, UCSC, IGV

  • Peak Annotation - finding interesting features surrounding peak regions:

    • PeakAnalyzer

    • ChIPpeakAnno (R package)

    • GREAT

    • bedtools

    • PAVIS

Analysis Downstream to Peak Calling

  • Visualisation - genome browser: Ensembl, UCSC, IGV

  • Peak Annotation - finding interesting features surrounding peak regions:

  • Correlation with expression data

  • Discovery of binding sequence motifs

  • Gene Ontology analysis on genes that bind the same factor or have the same modification

  • Correlation with SNP data to find allele-specific binding